SYS 6018 | Spring 2021 | University of Virginia
H. Diana McSpadden
When natural disasters or other emergencies result in compromised communications and transportation challenges, it can be impossible within a reasonable time to locate survivors using human-only methods. This disaster relief toy problem seeks to discover if an algorithm can effectively identify images corresponding to people who require relief.
To investigate whether this is possible a data set was provided containing RGB color values corresponding to pixels. The pixels are from high-resolution images taken by an aircraft above Haiti after the 2010 earthquake. Blue tarps had been distributed to survivors, but rescue workers did not have information about where survivors relocated after receiving the blue tarps.
Blue tarps have a distinct color when compared to other identifiable elements in Haiti. The training data set includes 63,241 RGB values for pixels in the toy problem’s images. The training data set includes a label for each pixel. The labels include:Below the training data is explored, and various modeling methods are applied to determine if, and which method can be effectively used to identify blue tarps by RGB values.
# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library("GGally")
library(gridExtra)
library(plotly)
library(caret)
library(boot)
library(regclass)
library(MLeval)
library(ggplot2)
library(purrr)
library(broom)
library(e1071)
library(caret)
library(boot)filename = 'HaitiPixels.csv'
#url = 'https://collab.its.virginia.edu/access/lessonbuilder/item/1707832/group/17f014a1-d43d-4c78-a5c6-698a9643404f/Module3/HaitiPixels.csv' #this url is beng
haiti <- read_csv(filename)
print(dim(haiti))#> [1] 63241 4
Below are the first 6 rows of the training dataset.
head(haiti)#> # A tibble: 6 x 4
#> Class Red Green Blue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are each values between 0 and 255 used in the additive RBG color model.
To prepare the data for exploratory data analysis I make Class a factor.
haiti %>%
mutate(Class = factor(Class)) #> # A tibble: 63,241 x 4
#> Class Red Green Blue
#> <fct> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
#> 7 Vegetation 71 72 51
#> 8 Vegetation 69 70 49
#> 9 Vegetation 68 70 49
#> 10 Vegetation 67 70 50
#> # ... with 63,231 more rows
Examine the numbers and percentages in each of the 5 classes:
haiti %>%
group_by(Class) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 5 x 3
#> Class N Perc
#> * <chr> <int> <dbl>
#> 1 Blue Tarp 2022 3
#> 2 Rooftop 9903 16
#> 3 Soil 20566 33
#> 4 Various Non-Tarp 4744 8
#> 5 Vegetation 26006 41
The records are not evenly distributed between the categories. Of the Classes, Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.
It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.
After reviewing box plots for the 2-class data set, I also created two new calculated variables:
1. GBSqr = (Green + Blue)^2 * .001
2. RBSqr = (Red + Blue)^2 * .001
I created these to continue using the Red and Green values, but I wanted to increase the difference in median value difference between the positive and negative classes. There is significant interplay in color values between Red, Green, and Blue in identifying the correct shade or blue, and I want to continue using Red and Green values but increase the linear separability between the classes. The 0.01 multiplier is to return the number scale to a range similar to standard RGB values, i.e, a manual “scaling” of the new parameters.
haitiBinary = haiti %>%
mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))
haitiBinarySqrs = haiti %>%
mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))Re-examine the numbers and percentages in each of the 2 classes:
haitiBinary %>%
group_by(ClassBinary) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 2 x 3
#> ClassBinary N Perc
#> * <fct> <int> <dbl>
#> 1 0 61219 97
#> 2 1 2022 3
redplot <- ggplot(haiti, aes(x=Class, y=Red)) +
geom_boxplot(col='red')
greenplot <- ggplot(haiti, aes(x=Class, y=Green)) +
geom_boxplot(col='darkgreen')
blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplot, greenplot, blueplot)redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB) ### How are red, blue and green values distributed between the 2 classes with the square values for Red + Blue and Green Blue?
redplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=RBSqr)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=GBSqr)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB)For the 5-class box plots:
“Blue Tarp” as the “positive” result, and other results as the “negative” result.
Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their RGB distributions. “Rooftop” and “Various Non-Tarp” are more similar in their RBG distributions
For the 2-class box plots:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.
Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results.
Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
For the 2-class box plots with the additive square values:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the RBSqr and GBSqr values have much less overlap than without the additive square variables.
The values of RBSqr have a larger range for negative results than for negative results, and median is significantly greater in the positive results.
GBSqr values have a larger range for negative results than for positive results. The positive results have a significantly higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
These correlations make sense as the pixels were of highly saturated/“additive” colors. There are few pixels in the data set with low values for R,G,B.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haiti, progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haitiBinary[-1], progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(haitiBinarySqrs[-1], progress = F)#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
The RBSqr and GBSqr have significantly less variance in their values, and better differentiation between the 2 classes than the Red and Green variables. I will be using these transformed variables in my models.
To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is illustrative.
fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
fiveCat3D = fiveCat3D %>%
layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))
fiveCat3D5-Class 3-D Scatter Plot Observations
One can see that there are discernible groupings of pixel categories by RGB values. Unsurprisingly, the blue tarps are higher blue values, but they do have a range of red and green values.
The 3D scatter plot is particularly useful because, by zooming in, one can see that while the ‘Blue Tarp’ values are generally distinct, there is a space in the 3D plot with mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.
binary3D = plot_ly(x=haitiBinarySqrs$RBSqr, y=haitiBinarySqrs$Blue, z=haitiBinarySqrs$GBSqr, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
binary3D = binary3D %>%
layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "RBSqr", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "GBSqr", color="green")))
binary3D2-Class 3-D Scatter Plot Observations With Blue, GBSqr, and RBSqr
Similar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. Visually, there is an near-distinct linear boundary between the blue tarp and non-blue tarp observations.
For logistic regression, LDA, QDA, KNN and Penalized Logistic Regression Cross-Validation threshold performance use ROC and Accuracy for tuning.
The following performance measures are collected for both the 10-fold cross-validation and the hold-out/testing/validation data:
For the Models: * No: Not a Blue Tarp is Negative * Yes: Is a Blue Tarp is Positive
In order to use R’s built-in factor function I set ClassBinary to a factor and order it “No”, “Yes”.
I also review the factor counts and create a dataframe named “train”.
levels(haitiBinarySqrs$ClassBinary)#> [1] "0" "1"
levels(haitiBinarySqrs$ClassBinary)=c("No","Yes")
levels(haitiBinarySqrs$ClassBinary)#> [1] "No" "Yes"
fct_count(haitiBinarySqrs$ClassBinary)#> # A tibble: 2 x 2
#> f n
#> <fct> <int>
#> 1 No 61219
#> 2 Yes 2022
Using 10-fold cross validation and p values in the collection (.1,.2,.3,.35,.4,.5,.6,.7,.8,.9) I tested three models:
** Model 1: Greatest Complexity:** \[ClassBinary = Blue + GBSqr + RBSqr \hspace{.3cm} | \hspace{.3cm} GBSqr = (Green + Blue)^2 \hspace{.3cm} and \hspace{.3cm} RBSqr = (Red + Blue)^2\]
** Model 2: Standard Additive Model:** \[ClassBinary = Blue + Green + Red\]
** Model 3: Simple Logistic Regression Model:** \[ClassBinary = Blue\]
library(yardstick)
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
# create storage variables for the p value, ROC, Specificity, and Sensitivity
k_start = 1
k_end = 17
out_nvar = k_end - k_start + 1
p_i = rep(NA, out_nvar)
# complex model measures
sensitivity_c_i = rep(NA, out_nvar)
specificity_c_i = rep(NA, out_nvar)
prec_c_i = rep(NA, out_nvar)
# additive model measures
sensitivity_a_i = rep(NA, out_nvar)
specificity_a_i = rep(NA, out_nvar)
prec_a_i = rep(NA, out_nvar)
# simple model measures
sensitivity_s_i = rep(NA, out_nvar)
specificity_s_i = rep(NA, out_nvar)
prec_s_i = rep(NA, out_nvar)
counter = 1
for (j in c(.1,.2,.3,.35,.4, .5,.6,.7,.8,.9))
{
p_i[counter] = j
accumulator_c_sens = 0
accumulator_c_spec = 0
accumulator_c_prec = 0
accumulator_a_sens = 0
accumulator_a_spec = 0
accumulator_a_prec = 0
accumulator_s_sens = 0
accumulator_s_spec = 0
accumulator_s_prec = 0
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
# define complex model
glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)
# define additive model
glm.fits.additive = glm(ClassBinary ~ Blue+Green+Red, binomial, data = trainData)
# define simple model
glm.fits.simple = glm(ClassBinary ~ Blue, binomial, data = trainData)
# fit the complex model
glm.pred.complex = glm.fits.complex %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.complex$.prediction) > 1)
{
accumulator_c_sens = accumulator_c_sens + yardstick::sens(glm.pred.complex, ClassBinary, .prediction)[[3]]
accumulator_c_spec = accumulator_c_spec + yardstick::spec(glm.pred.complex, ClassBinary, .prediction)[[3]]
accumulator_c_prec = accumulator_c_prec + yardstick::precision(glm.pred.complex, ClassBinary, .prediction)[[3]]
}
# fit the additive model
glm.pred.additive = glm.fits.additive %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.additive$.prediction) > 1)
{
accumulator_a_sens = accumulator_a_sens + yardstick::sens(glm.pred.additive, ClassBinary, .prediction)[[3]]
accumulator_a_spec = accumulator_a_spec + yardstick::spec(glm.pred.additive, ClassBinary, .prediction)[[3]]
accumulator_a_prec = accumulator_a_prec + yardstick::precision(glm.pred.additive, ClassBinary, .prediction)[[3]]
}
# fit the simple model
glm.pred.simple= glm.fits.simple %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.simple$.prediction) > 1)
{
accumulator_s_sens = accumulator_s_sens + yardstick::sens(glm.pred.simple, ClassBinary, .prediction)[[3]]
accumulator_s_spec = accumulator_s_spec + yardstick::spec(glm.pred.simple, ClassBinary, .prediction)[[3]]
accumulator_s_prec = accumulator_s_prec + yardstick::precision(glm.pred.simple, ClassBinary, .prediction)[[3]]
}
}
sensitivity_c_i[counter] = accumulator_c_sens / 10
specificity_c_i[counter] = accumulator_c_spec / 10
prec_c_i[counter] = accumulator_c_prec /10
sensitivity_a_i[counter] = accumulator_a_sens / 10
specificity_a_i[counter] = accumulator_a_spec / 10
prec_a_i[counter] = accumulator_a_prec /10
sensitivity_s_i[counter] = accumulator_s_sens / 10
specificity_s_i[counter] = accumulator_s_spec / 10
prec_s_i[counter] = accumulator_s_prec /10
counter = counter + 1
}
outcome = data.frame(p_i, sensitivity_c_i, specificity_c_i, prec_c_i, sensitivity_a_i, specificity_a_i, prec_a_i, sensitivity_s_i, specificity_s_i, prec_s_i)
print(outcome, n = nrow(outcome))#> Error in print.default(m, ..., quote = quote, right = right, max = max): invalid 'na.print' specification
SLR Model 3: ClassBinary = Blue
Model 3, with the square terms, performed better on all measures.
Because geographic distribution of blue tarps is not included in this data, I will focus on precision and specificity, i.e. the model that identifies the most blue tarps. p = 0.1 identified the most blue tarps, and had the highest precision when using 10-fold cross validation.
Below I use the entire training data set on the model with p = 0.1 to get the training Accuracy, TPR, FPR, and Precision:
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
out_lr.complex = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
# define complex model
glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)
# fit the complex model
glm.pred.complex = glm.fits.complex %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .1, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
out_lr.complex = bind_rows(out_lr.complex,
tibble(truth = testData$ClassBinary,
prediction = glm.pred.complex$.prediction,
fitted = glm.pred.complex$.fitted))
}
caret::confusionMatrix(out_lr.complex$prediction, out_lr.complex$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 60987 41
#> Yes 232 1981
#>
#> Accuracy : 0.9957
#> 95% CI : (0.9951, 0.9962)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9333
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9962
#> Specificity : 0.9797
#> Pos Pred Value : 0.9993
#> Neg Pred Value : 0.8952
#> Prevalence : 0.9680
#> Detection Rate : 0.9644
#> Detection Prevalence : 0.9650
#> Balanced Accuracy : 0.9880
#>
#> 'Positive' Class : No
#>
For Logistic Regression, my calculations for Accuracy, TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
The following cross-validation measures were calculated with a threshold, p, of 0.1:library(ROCR)
# create a function so I can use repeatedly
function_ROC_AUC = function(fitted, truth, title)
{
##produce the numbers associated with classification table
rates = prediction(fitted, truth)
##store the true positive and false postive rates
roc_result = performance(rates, measure = "tpr", x.measure = "fpr")
par(mfrow=c(1,2))
##plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main = title)
lines(x = c(0,1), y = c(0,1), col="red")
plot(roc_result, avg= "vertical", spread.estimate="boxplot", lwd=2,col='blue',
show.spread.at= seq(0.1, 0.9, by=0.1),
main= "Accuracy across cutoffs", xlab="Threshold cutoff")
##compute the AUC
auc = performance(rates, measure = "auc")
print(auc@y.values[[1]])
}
function_ROC_AUC(out_lr.complex$fitted, out_lr.complex$truth, "ROC For Tuned Complex Logisitic Regressions Model")#> [1] 0.9995643
The Logistic Regression ROC-AUC for the 10-fold cross-validated training data with p=0.1 is: 0.9996.
I trained the LDA model using 10-fold cross validation.
10-fold cross validation was used for p-values from 0.1 to .95. Tuning was performed using ROC.
I tested with thresholds in c(.1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95). The best performing threshold on accuracy was 0.3. For illustrative purposes I run with 0.3, and 0.4.
#y_true
set.seed(1976)
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
trctrl <- trainControl(method = "cv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T)
out_lda_p = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_lda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "lda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_lda, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinary
# check the threshold
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.2, .8)){
for (p in c(.3, .4)){
#y_hat = tibble(hat_pred = preds$Yes > p) %>% mutate(hat_value = ifelse(preds$Yes > p))
y_hat = preds %>% mutate(hat_value = ifelse(Yes > p, "Yes", "No")) %>% select(hat_value)
acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out_lda_p = bind_rows(out_lda_p, tibble(threshold = p, accuracy = acc))
}
}
#out_lda_p#-- Get mean accuracy
perf_lda_p = out_lda_p %>%
group_by(threshold) %>%
summarize(mean_acc = mean(accuracy))
perf_lda_p %>% arrange(-mean_acc)#> # A tibble: 2 x 2
#> threshold mean_acc
#> <dbl> <dbl>
#> 1 0.3 0.994
#> 2 0.4 0.994
The threshold with the greatest accuracy using a LDA model uses a probability threshold of 0.3.
# tibble to save all predictions at the best threshold
out_best_lda = tibble()
out_all_preds = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_lda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "lda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_lda, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinary
y_hat = preds %>% mutate(hat_value = ifelse(Yes > 0.3, "Yes", "No")) %>% select(hat_value)
#- output
out_best_lda = bind_rows(out_best_lda, tibble(pred = y_hat$hat_value, truth = testData$ClassBinary))
out_all_preds = bind_rows(out_all_preds, tibble(Yes = preds$Yes))
}out_best_lda = out_best_lda %>% mutate(pred = factor(pred), truth = factor(truth))
caret::confusionMatrix(out_best_lda$pred, out_best_lda$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 61153 284
#> Yes 66 1738
#>
#> Accuracy : 0.9945
#> 95% CI : (0.9939, 0.995)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9057
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9989
#> Specificity : 0.8595
#> Pos Pred Value : 0.9954
#> Neg Pred Value : 0.9634
#> Prevalence : 0.9680
#> Detection Rate : 0.9670
#> Detection Prevalence : 0.9715
#> Balanced Accuracy : 0.9292
#>
#> 'Positive' Class : No
#>
For LDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
function_ROC_AUC(out_all_preds$Yes, out_best_lda$truth, "ROC For Tuned LDA Model")#> [1] 0.9944908
The LDA AUC based on 10-fold cross-validated training data with a threshold of 0.3 is: 0.99.
I trained the QDA model using 10-fold cross validation.
10-fold cross validation was used and various thresholds were evaluated for best accuracy. 0.8 produced the highest accuracy. For illustrative purposes I only test thresholds of 0.7, 0.8, an 0.9.
set.seed(1976)
# I will use the same folds calculated in the LDA code#
trctrl <- trainControl(method = "cv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T)
out_qda_p = tibble()
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.3, .4)){
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_qda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "qda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_qda, newdata = testData, type="prob")
#- evaluate fold k
for (p in c(.7, .8, .9)){
y_true = testData$ClassBinary
# check the threshold
y_hat = preds %>% mutate(hat_value = ifelse(Yes > p, "Yes", "No")) %>% select(hat_value)
acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out_qda_p = bind_rows(out_qda_p, tibble(threshold = p, accuracy = acc))
}
}#-- Get mean accuracy
perf_qda_p = out_qda_p %>%
group_by(threshold) %>%
summarize(mean_acc = mean(accuracy))
perf_qda_p %>% arrange(-mean_acc)#> # A tibble: 3 x 2
#> threshold mean_acc
#> <dbl> <dbl>
#> 1 0.8 0.995
#> 2 0.9 0.995
#> 3 0.7 0.995
Based on the highest Mean Accuracy of cross validation of the training data, I am selecting a threshold of 0.8 for the QDA model.
# tibble to save all predictions at the best threshold
out_best_qda = tibble()
out_all_preds_qda = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_qda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "qda", trControl = trctrl, tuneLength=10)
# test with the fold's test data
preds = predict(train_qda, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinary
y_hat = preds %>% mutate(hat_value = ifelse(Yes > 0.8, "Yes", "No")) %>% select(hat_value)
#- output
out_best_qda = bind_rows(out_best_qda, tibble(pred = y_hat$hat_value, truth = testData$ClassBinary))
out_all_preds_qda = bind_rows(out_all_preds_qda, tibble(Yes = preds$Yes))
}out_best_qda = out_best_qda %>% mutate(pred = factor(pred), truth = factor(truth))
caret::confusionMatrix(out_best_qda$pred, out_best_qda$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 61149 254
#> Yes 70 1768
#>
#> Accuracy : 0.9949
#> 95% CI : (0.9943, 0.9954)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9134
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9989
#> Specificity : 0.8744
#> Pos Pred Value : 0.9959
#> Neg Pred Value : 0.9619
#> Prevalence : 0.9680
#> Detection Rate : 0.9669
#> Detection Prevalence : 0.9709
#> Balanced Accuracy : 0.9366
#>
#> 'Positive' Class : No
#>
For QDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
function_ROC_AUC(out_all_preds_qda$Yes, out_best_qda$truth, "ROC For Tuned QDA Model")#> [1] 0.997386
The QDA AUC based on 10-fold cross-validated training data with a threshold of 0.8 is: 0.997.
After testing over ranges from 1 to 51, the best k was 27. The code used is shown below, but not evaluated.
set.seed(1976)
trControl.knn <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = 0.5)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trControl.knn, tuneGrid = expand.grid(k = 26:28))
knn.cv.modelset.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 17:35))
knn.cv.modelI order to speed up this Rmd file I am no longer evaluating the following code chunk that evaluated k = 27 - 51. When this chunk is run, the best k was still 27.
set.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 27:51))
knn.cv.model
From 1 - 51, the best k is 27. The tables of ROC, Sensitivity and Specificity were reviewed for each cross-validation training. From these tables one can see that the improvements within the range are in the hundredths of a percentage point of ROC, so k’s in the range of 10 - 51, appear reasonable selections for the cross-validated training data.
Running manual KNN cross validation to tune for p took significant time for my laptop to process. I ran with smaller lists of p just to get a feel for the results. I had previously ran with p = 0.5 and received accuracy of 0.996.
For processing speed time, I commented out the extending loop of threshold values, and limited the list of 0.4 and 0.5 for illustrative purposes.
library(class)
out_knn_p = tibble()
haitiBinarySqrsShuffled = haitiBinarySqrsShuffled %>% mutate(ClassBinaryTF = factor(if_else(ClassBinary == "No", FALSE, TRUE)))
#for (p in c(.1,.2,.3,.7,.8,.9))
#for (p in c(.2,.3,.4,.5,.6,.7,.8))
for (j in 1:10)
{
#Segement your data by fold using the which() function
testIndexes = which(folds==j,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_knn <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "knn", tuneGrid = data.frame(k=27))
# test with the fold's test data
preds = predict(train_knn, newdata = testData, type="prob")
for (p in c(.4,.5))
{
#- evaluate fold k
y_true = testData$ClassBinaryTF
# set the threshold to p
thres = p
y_hat = tibble(preds$Yes > thres)
acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out_knn_p = bind_rows(out_knn_p, tibble(threshold = p, accuracy = acc))
}
}
out_knn_p#> # A tibble: 20 x 2
#> threshold accuracy
#> <dbl> <dbl>
#> 1 0.4 0.997
#> 2 0.5 0.997
#> 3 0.4 0.996
#> 4 0.5 0.997
#> 5 0.4 0.998
#> 6 0.5 0.998
#> 7 0.4 0.995
#> 8 0.5 0.996
#> 9 0.4 0.998
#> 10 0.5 0.997
#> 11 0.4 0.997
#> 12 0.5 0.997
#> 13 0.4 0.996
#> 14 0.5 0.996
#> 15 0.4 0.997
#> 16 0.5 0.997
#> 17 0.4 0.997
#> 18 0.5 0.997
#> 19 0.4 0.997
#> 20 0.5 0.997
#-- Get mean accuracy
perf_knn_p = out_knn_p %>%
group_by(threshold) %>%
summarize(mean_acc = mean(accuracy))
perf_knn_p %>% arrange(-mean_acc)#> # A tibble: 2 x 2
#> threshold mean_acc
#> <dbl> <dbl>
#> 1 0.5 0.997
#> 2 0.4 0.997
After testing the following values of p: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, p == 0.5 is the value that produced the highest mean accuracy. This is convenient because caret’s cross-validation for KNN models default to a 0.5 threshold.
caret::confusionMatrix(knn.cv.model)#> Error in caret::confusionMatrix(knn.cv.model): object 'knn.cv.model' not found
result.knn = evalm(knn.cv.model)#> ***MLeval: Machine Learning Model Evaluation***
#> Error in evalm(knn.cv.model): object 'knn.cv.model' not found
result.knn$roc#> Error in eval(expr, envir, enclos): object 'result.knn' not found
Subset selection using glmnet/ElasticNet provides an opportunity for me to see if additional predictors are of use improving the precision of our model.
I added additional terms to the model. I selected additional additive relations between the colors because RGB color is, by design, an additive color model. The interplay between Red, Blue, and Green are intuitively significant because the combinations of these values create the visible spectrum of color.
The predictors in the ElasticNet model are:haitiBinaryFull = haitiBinarySqrs %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2))) head(haitiBinaryFull)#> # A tibble: 6 x 9
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr
#> <chr> <dbl> <dbl> <dbl> <I<dbl>> <I<dbl>> <fct> <I<dbl>> <I<dbl>>
#> 1 Vegetation 64 67 50 13.7 13.0 No 17161 32761
#> 2 Vegetation 64 67 50 13.7 13.0 No 17161 32761
#> 3 Vegetation 64 66 49 13.2 12.8 No 16900 32041
#> 4 Vegetation 75 82 53 18.2 16.4 No 24649 44100
#> 5 Vegetation 74 82 54 18.5 16.4 No 24336 44100
#> 6 Vegetation 72 76 52 16.4 15.4 No 21904 40000
I did not scale the predictors because the glmnet library will scale to standard deviation units. For the glmnet library, the training data must be in matrix format:
frmla = as.formula(ClassBinary~Red+Green+Blue+GBSqr+RBSqr+RGSqr+RGBSqr)
x.haitiFull.train = model.matrix(frmla, data = haitiBinaryFull)[,-1] # removing the intercept term from the formula
# switch ClassBinary to TRUE FALSE to facilitate the next bit of code
y.haitiFull.train = (haitiBinaryFull %>% mutate(ClassBinary = ifelse(ClassBinary == 'No', FALSE, TRUE)))$ClassBinaryUsing penalized logistic regression (PLR), I evaluated three different PLR methods: ridge, lasso, and elasticnet (a combination of ridge and lasso). I tuned both lambda and the probability threshold (p). Based on best testing of the threshold tuning parameter, I considered p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95.
For the sake of processing speed I commented out the loop for the extended p values (0.2 was the best threshold when I considered the extended list), and kept 0.1, 0.2, 0.3 for illustrative purposes.
After testing ridge, elasticnet, and lasso, lasso performed the best. For speed of processing, I am only demonstrating the loop with a == 1 == lasso.
library(glmnet)
set.seed(1976)
# number of folds
K = 10
# make folds
folds = rep(1:K, length=nrow(x.haitiFull.train))
out = tibble()
# LOOP FOR alpha: 0 (ridge), 0.5 (elasticnet), and 1 (lasso)
#for (a in c(0, .5, 1)){
for (a in c(1)){
# lambda may be different for the different PLR methods, so this is decided within the alpha loop
#-- Get lambda sequence so consistent over all folds
lam_seq = glmnet(x.haitiFull.train, y.haitiFull.train, family="binomial", alpha=a)$lambda
#-- Loop over K folds
for(k in unique(folds)){
#- Get train/test split for fold k
ind.train = which(folds != k)
ind.test = which(folds == k)
#- fit the alpha model over all lambdas in lam_seq
fit.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train],
alpha = a, # use the alpha for the loop
family="binomial", # logistic regression
lambda = lam_seq) # all models in this alpha loop use same lambda sequence
#- make predictions on test set fold k
pred = predict(fit.model, x.haitiFull.train[ind.test, ], s=lam_seq, type = "response")
#- evaluate fold k
y_true = y.haitiFull.train[ind.test]
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95)){
for (p in c(.1, .2, .3)){
# set the threshold to p
thres = p
y_hat = pred > thres
acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out = bind_rows(out,
tibble(accuracy = acc,
lambda = lam_seq,
n_train = length(ind.train),
n_test = length(ind.test),
alpha = a,
thres = p,
k = k))
}
}
}out %>% filter(accuracy == max(accuracy))#> # A tibble: 10 x 7
#> accuracy lambda n_train n_test alpha thres k
#> <dbl> <dbl> <int> <int> <dbl> <dbl> <int>
#> 1 0.997 0.0000139 56917 6324 1 0.1 4
#> 2 0.997 0.0000127 56917 6324 1 0.1 4
#> 3 0.997 0.0000563 56917 6324 1 0.2 4
#> 4 0.997 0.0000513 56917 6324 1 0.2 4
#> 5 0.997 0.0000222 56917 6324 1 0.3 10
#> 6 0.997 0.0000202 56917 6324 1 0.3 10
#> 7 0.997 0.0000184 56917 6324 1 0.3 10
#> 8 0.997 0.0000168 56917 6324 1 0.3 10
#> 9 0.997 0.0000153 56917 6324 1 0.3 10
#> 10 0.997 0.0000139 56917 6324 1 0.3 10
#-- Get mean accuracy
perf = out %>%
group_by(alpha, thres, lambda) %>%
summarize(mean_acc = mean(accuracy), se_acc = sd(accuracy)/k) #> `summarise()` has grouped output by 'alpha', 'thres', 'lambda'. You can override using the `.groups` argument.
perf %>% arrange(-mean_acc)#> # A tibble: 2,640 x 5
#> # Groups: alpha, thres, lambda [264]
#> alpha thres lambda mean_acc se_acc
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.2 0.0000563 0.996 0.000467
#> 2 1 0.2 0.0000563 0.996 0.000233
#> 3 1 0.2 0.0000563 0.996 0.000156
#> 4 1 0.2 0.0000563 0.996 0.000117
#> 5 1 0.2 0.0000563 0.996 0.0000933
#> 6 1 0.2 0.0000563 0.996 0.0000778
#> 7 1 0.2 0.0000563 0.996 0.0000667
#> 8 1 0.2 0.0000563 0.996 0.0000583
#> 9 1 0.2 0.0000563 0.996 0.0000519
#> 10 1 0.2 0.0000563 0.996 0.0000467
#> # ... with 2,630 more rows
log(5.630136e-05)#> [1] -9.784792
The plot below charts the change in mean accuracy as the log of lambda increases for each of the 10 folds. The red error bars display the difference between the (mean accuracy - standard error accuracy) and the (max accuracy + the standard error accuracy) - giving the range of mean accuracy for log lambda. The plot demonstrates that generally accuracy decreases, and the standard error of accuracy increases as log lambda increases. The highest mean accuracy is at a log lambda -9.78.
ggplot(perf, aes(log(lambda), mean_acc)) +
geom_point() + geom_line() +
geom_errorbar(aes(ymin=mean_acc-se_acc, ymax=mean_acc + se_acc),
color="red", alpha=1)From the manual cross-validation testing of accuracy the PLR lasso method with lambda of 5.630136e-05 and a threshold of 0.2 produced the greatest mean accuracy: 0.996.
out_confusion = tibble()
#-- Loop over K folds
for(k in unique(folds)){
#- Get train/test split for fold k
ind.train = which(folds != k)
ind.test = which(folds == k)
#- fit the lasso
cv.glmnet.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train],
alpha = 1,
family="binomial", # logistic regression
lambda = lam_seq)
#- make predictions on test set fold k
pred = predict(cv.glmnet.model, x.haitiFull.train[ind.test, ], s=0.00005630136, type = "response")
#- evaluate fold k
y_true = y.haitiFull.train[ind.test]
# set the threshold to p
thres = 0.2
y_hat = pred > thres
#- output
out_confusion = bind_rows(out_confusion,
tibble(truth = y_true,
glmnet.fitted = y_hat))
}#> Error in glmnet(x.haitiFull.train[ind.train, ], y.haitiFull.train[ind.train], : could not find function "glmnet"
The optimal lambda:
coef(cv.glmnet.model, 0.00005630136)#> Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'coef': object 'cv.glmnet.model' not found
This resulting model is interesting. Cross-validated lasso, with p = 0.2 set the coefficients for GBSqr, RBSqr and RGBSqr to 0. This could be due to collinearity with Blue, or because the variables are truly not significant to the accuracy of the model. This did determine that the new variable RGSqr was significant.
out_confusion %>%
mutate(truth = factor(truth), glmnet.fitted = factor(glmnet.fitted)) %>%
conf_mat(truth, glmnet.fitted)#> Error in conf_mat(., truth, glmnet.fitted): could not find function "conf_mat"
assess.glmnet(pred, newy = y.haitiFull.train[ind.test], family="binomial")#> Error in assess.glmnet(pred, newy = y.haitiFull.train[ind.test], family = "binomial"): could not find function "assess.glmnet"
AUC for the lasso PLR model with a lambda of 0.00005630136 is 0.9996.
Using lasso penalized logistic regression with a threshold = 0.2, lambda = 0.00005630136:I was curious to examine the performance of the model selected by lasso in a logistic regression model with a threshold also selected by lasso (0.2). This is the selected PLR model without the lambda penalty term.
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsFullShuffled = haitiBinaryFull[sample(nrow(haitiBinaryFull)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsFullShuffled)),breaks=10,labels=FALSE)
out_lr.full = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsFullShuffled[testIndexes, ]
trainData = haitiBinarySqrsFullShuffled[-testIndexes, ]
# define complex model
glm.fits.full = glm(ClassBinary ~ Blue+Green+Red+RGSqr, binomial, data = trainData)
# fit the glmnet lasso model
glm.pred.full = glm.fits.full %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .2, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
out_lr.full = bind_rows(out_lr.full,
tibble(truth = testData$ClassBinary,
prediction = glm.pred.full$.prediction,
fitted = glm.pred.full$.fitted))
}
caret::confusionMatrix(out_lr.full$prediction, out_lr.full$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 61039 73
#> Yes 180 1949
#>
#> Accuracy : 0.996
#> 95% CI : (0.9955, 0.9965)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.937
#>
#> Mcnemar's Test P-Value : 2.662e-11
#>
#> Sensitivity : 0.9971
#> Specificity : 0.9639
#> Pos Pred Value : 0.9988
#> Neg Pred Value : 0.9155
#> Prevalence : 0.9680
#> Detection Rate : 0.9652
#> Detection Prevalence : 0.9663
#> Balanced Accuracy : 0.9805
#>
#> 'Positive' Class : No
#>
function_ROC_AUC(out_lr.full$fitted, out_lr.full$truth, "ROC For Tuned Lasso Model")#> Error in prediction(fitted, truth): could not find function "prediction"
Using logistic regression with a threshold = 0.2 and the model selected by lasso:
It was interesting that this model performed slightly better on Precision, and FPR, but not as well based on other measures as the logistic regression model Red + Green + Blue + RBSqr + GBSqr.
haitiBinaryRF = haitiBinarySqrs %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2)), RG = (Red * Green), RB = (Red * Blue), GB = (Green * Blue))
head(haitiBinaryRF)#> # A tibble: 6 x 12
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB GB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct> <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~ 64 67 50 13.7 13.0 No 17161 32761 4288 3200 3350
#> 2 Vege~ 64 67 50 13.7 13.0 No 17161 32761 4288 3200 3350
#> 3 Vege~ 64 66 49 13.2 12.8 No 16900 32041 4224 3136 3234
#> 4 Vege~ 75 82 53 18.2 16.4 No 24649 44100 6150 3975 4346
#> 5 Vege~ 74 82 54 18.5 16.4 No 24336 44100 6068 3996 4428
#> 6 Vege~ 72 76 52 16.4 15.4 No 21904 40000 5472 3744 3952
The first random forest tuning parameter I will test for is number of trees and mtry. I will test ntree values by cross validation.
I did run the ntree loop for 250, 500, and 1000 trees. The best performing number of trees was 1000.
I tuned with a list of mtry values of 3,4,5,6,7,8,9 and 10. 4 was the best performing. In order to lessen the amount of timed needed to knit this Rmd I am not evaluating the following code block.
control = trainControl(method = 'repeatedcv', number = 10, repeats = 3, search = 'grid')
# possible mtry values
tunegrid = expand.grid(.mtry = c(3,4,5,6,7,8,9,10))
modellist = list()
#train with different ntree parameters
#for (ntree in c(250,500,1000)){
for (ntree in c(1000)){
set.seed(123)
rf.fit = train(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB,
data = haitiBinaryRF,
method = 'rf',
metric = 'Accuracy',
tuneGrid = tunegrid,
trControl = control,
ntree = ntree)
modellist[[toString(ntree)]] = rf.fit
}
# look at the results of the cross validation with different ntree and mtry values
results = resamples(modellist)
summary(results)rf.fit$bestTunesummary(rf.fit$finalModel$ntree)as.data.frame(rf.fit$finalModel$importance) %>% arrange(MeanDecreaseGini)The above code took such a long time to run, I am not longer evaluating in my Rmd. Here are screen shots of the output to support my further evaluation of the selected random forest model:
RF Tuning Reuslts
RF MTRY Reuslts
RF Importance Results
Next, I need to test for the best probability threshold using the tuned tree with number of predictors considered at each split == 4, and number of trees = 1000. I will use accuracy for selecting the best p value.
set.seed(123)
#Randomly shuffle the data
haitiRFShuffled = haitiBinaryRF[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiRFShuffled)),breaks=10,labels=FALSE)
head(haitiRFShuffled)#> # A tibble: 6 x 12
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct> <I<db> <I<db> <dbl> <dbl>
#> 1 Roof~ 240 229 205 188. 198. No 219961 454276 54960 49200
#> 2 Vari~ 91 77 66 20.4 24.6 No 28224 54756 7007 6006
#> 3 Vege~ 70 69 51 14.4 14.6 No 19321 36100 4830 3570
#> 4 Vari~ 127 110 96 42.4 49.7 No 56169 110889 13970 12192
#> 5 Soil 255 255 218 224. 224. No 260100 529984 65025 55590
#> 6 Soil 255 233 169 162. 180. No 238144 431649 59415 43095
#> # ... with 1 more variable: GB <dbl>
I did run this loop with threshold 0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 and 0.7 performed the best.
For the sake of faster compilation time, I am limited the threshold loop to 0.6, and 0.7 to illustrate the process.
# create storage variables for the p value, Accuracy, ROC, Specificity, and Sensitivity
all_out_yhat_ytrue = tibble()
out_yhat_ytrue = tibble()
out_rf = tibble()
# reset the fold accumulator tibble
out_yhat_ytrue = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segment data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiRFShuffled[testIndexes, ]
trainingData = haitiRFShuffled[-testIndexes, ]
#small training and test sets
# comment this out to get real values
#testData = testData[1:100,]
#trainingData= head(trainingData, 500)
#end of small training and test sets
# training
rf.haiti = randomForest(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data = trainingData, mtry=4, ntree=1000)
# test
rf.preds = predict(rf.haiti, newdata = testData, type="prob")
#- evaluate fold
y_true = testData %>% mutate(ClassBinary = (ClassBinary == "Yes"), Class0or1 = ifelse(ClassBinary == FALSE,0,1))
y_true_val = y_true$ClassBinary
#y_true
#for (j in c(.2,.3,.4,.5,.6,.7,.8,.9))
for (j in c(.6,.7)) {
# set the threshold to p
y_hat = rf.preds[,2] > j
#y_hat
#- output
out_yhat_ytrue = bind_rows(out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat, truth0or1 = y_true$Class0or1, X0 = rf.preds[,1], X1 = rf.preds[,2]))
all_out_yhat_ytrue = bind_rows(all_out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat))
acc = sum(out_yhat_ytrue$hat == out_yhat_ytrue$true) / nrow(out_yhat_ytrue)
out_rf = bind_rows(out_rf,
tibble(accuracy = acc, thres = j))
}
}#out_rf
out_rf %>%
group_by(thres) %>%
summarize_at(vars(accuracy), list(mean_acc=mean))#> # A tibble: 2 x 2
#> thres mean_acc
#> * <dbl> <dbl>
#> 1 0.6 0.997
#> 2 0.7 0.997
Random forest cross validation with:
These tuning parameters produced the best accuracy.
Next, I produce the confusion matrix and ROC curve
# filter only the true and hat values where thres == 0.3
best_rf_df = all_out_yhat_ytrue %>%
mutate(factortrue = factor(true), factorhat = factor(hat)) %>%
filter(thres == 0.7)
head(best_rf_df)#> # A tibble: 6 x 5
#> thres true hat factortrue factorhat
#> <dbl> <lgl> <lgl> <fct> <fct>
#> 1 0.7 FALSE FALSE FALSE FALSE
#> 2 0.7 FALSE FALSE FALSE FALSE
#> 3 0.7 FALSE FALSE FALSE FALSE
#> 4 0.7 FALSE FALSE FALSE FALSE
#> 5 0.7 FALSE FALSE FALSE FALSE
#> 6 0.7 FALSE FALSE FALSE FALSE
caret::confusionMatrix(best_rf_df$factorhat, best_rf_df$factortrue)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 61174 164
#> TRUE 45 1858
#>
#> Accuracy : 0.9967
#> 95% CI : (0.9962, 0.9971)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.945
#>
#> Mcnemar's Test P-Value : 3.289e-16
#>
#> Sensitivity : 0.9993
#> Specificity : 0.9189
#> Pos Pred Value : 0.9973
#> Neg Pred Value : 0.9764
#> Prevalence : 0.9680
#> Detection Rate : 0.9673
#> Detection Prevalence : 0.9699
#> Balanced Accuracy : 0.9591
#>
#> 'Positive' Class : FALSE
#>
Produce the ROC and calculate the AUC for the tuned random forest model:
function_ROC_AUC(out_yhat_ytrue$X1, out_yhat_ytrue$truth0or1, "ROC For Tuned RF Model")#> [1] 0.9942587
Based on the 3D visualization of the dataset I generated earlier, I believe the data will separate well with a linear kernel. Using cross validation, I will tune a linear kernel with cost values that include 0.001, 0.01, 0.1, 1, 10, and 100.
I ran the svm with possible costs including 0.01, 0.1, 1, and 10 and probability thresholds .2, .3, .4, .5, .6, .7, and .8. The best performing model used cost == 1 and threshold == 0.2. For speed of knitting this Rmd I only include those values in the code below:
out_svm = tibble()
out_svm_accuracy = tibble()
#Perform 10 fold cross validation with various costs
#Create 10 equally size folds
set.seed(1976)
folds <- cut(seq(1,nrow(haitiRFShuffled)),breaks=10,labels=FALSE)
#for (cost_val in c( 0.01, 0.1, 1, 10))
for (cost_val in c(1))
{
for(i in 1:10)
{
# Use my previously generated folds
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiRFShuffled[testIndexes, ]
trainData = haitiRFShuffled[-testIndexes, ]
svm_model = tune(svm, ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data=trainData, kernel="linear", ranges=list(cost=c(cost_val)), probability=TRUE)
pred = predict(svm_model$best.model, newdata = testData, probability=TRUE)
# to speed this up, loop through the preds after the svm has been fit.
#for (thres in c(.2,.3,.4,.5,.6,.7,.8))
for (thres in c(.2))
{
y_hat = attr(pred, "probabilities")[,"Yes"] > thres
acc = sum(y_hat == (testData$ClassBinary == "1")) / nrow(testData)
out_svm = bind_rows(out_svm,
tibble(thres = thres, cost = cost_val, pred_value = y_hat, truth_value = testData$ClassBinary, accuracy = acc, noProb = attr(pred, "probabilities")[,"No"], yesProb = attr(pred, "probabilities")[,"Yes"]))
}
}
}out_svm %>% group_by(thres, cost) %>% summarize(acc = mean(accuracy)) %>% arrange(desc(acc))#> `summarise()` has grouped output by 'thres'. You can override using the `.groups` argument.
#> # A tibble: 1 x 3
#> # Groups: thres [1]
#> thres cost acc
#> <dbl> <dbl> <dbl>
#> 1 0.2 1 0.967
The highest performing cross-validated svm used a cost of 1 and a threshold of 0.2.
best_svm = out_svm %>% filter(thres == 0.2 & cost == 1.00) %>% mutate(truth_value = (truth_value == "Yes")) %>% mutate(pred_value = factor(pred_value), truth_value = factor(truth_value))
head(best_svm)#> # A tibble: 6 x 7
#> thres cost pred_value truth_value accuracy noProb yesProb
#> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 0.2 1 FALSE FALSE 0.966 1.00 0.000000100
#> 2 0.2 1 FALSE FALSE 0.966 1.00 0.000348
#> 3 0.2 1 FALSE FALSE 0.966 1.00 0.000351
#> 4 0.2 1 FALSE FALSE 0.966 1.00 0.000103
#> 5 0.2 1 FALSE FALSE 0.966 1.00 0.000000100
#> 6 0.2 1 FALSE FALSE 0.966 1.00 0.000000100
caret::confusionMatrix(best_svm$pred_value, best_svm$truth_value)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 61063 75
#> TRUE 156 1947
#>
#> Accuracy : 0.9963
#> 95% CI : (0.9958, 0.9968)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9421
#>
#> Mcnemar's Test P-Value : 1.413e-07
#>
#> Sensitivity : 0.9975
#> Specificity : 0.9629
#> Pos Pred Value : 0.9988
#> Neg Pred Value : 0.9258
#> Prevalence : 0.9680
#> Detection Rate : 0.9656
#> Detection Prevalence : 0.9667
#> Balanced Accuracy : 0.9802
#>
#> 'Positive' Class : FALSE
#>
function_ROC_AUC(best_svm$yesProb, best_svm$truth_value, "ROC Training SVM")#> Error in prediction(fitted, truth): could not find function "prediction"
Using SVM with a threshold = 0.2 and a cost of 1.0:
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision | |
| Log Reg | N/A | 0.9996 | 0.1 | 0.9957 | 0.9802 | 0.0038 | 0.896 | |
| LDA | N/A | 0.9945 | 0.3 | 0.9945 | 0.859 | 0.001 | 0.9954 | |
| QDA | N/A | 0.997 | 0.8 | 0.9949 | 0.874 | 0.001 | 0.9959 | |
| KNN | k = 27 | 1.0 | 0.5 | 0.997 | 0.998 | 0.0313 | 0.999 | |
| Pen Log Reg | alpha = 1, lambda = 5.630136e-05 | 0.9996 | 0.2 | 0.996 | 0.959 | 0.0028 | 0.918 | |
| Random Forest | mtry = 4, ntree = 1000 | 0.994 | 0.6 | 0.997 | 0.932 | 0.0009 | 0.9706 | |
| SVM | Cost = 1.0 | 0.9995 | 0.2 | 0.9963 | 0.962 | 0.0025 | 0.9262 |
Training Data Linearly Separates, but Model is More Than Just Blue
Visually, the training data linearly separates very well. Even the 5-class, un-transformed data set was visually separable in the 3-D R-G-B visualization of “Blue Tarps” vs. the other four classes. Collapsing the classes and transforming the variables to decrease variability of Red and Green further improved the linear separability. All the models used performed with 99% accuracy with cross validation using the training data. All except the Lasso model used the following formula:
\[ ClassBinary = Blue+Green+Red+GBSqr+RBSqr\] where \(GBSqr=(Green + Blue)^2\) and \(RBSqr=(Red + Blue)^2\)
The lasso model selected the formula: \[ClassBinary = Red + Green + Blue + (Red + Green)^2\]
The functions with the additive sqare terms performed better than the simplest model (ClassBinary = Blue), and slightly better than the basic additive model (ClassBinary = Blue + Red + Green).
Even though a pixel with only a blue tarp in it is easily identifiable, pixels partially representing blue tarps require more predictors to perform better than guessing.
I look forward to discovering if the hold-out/validation data set also separates as well.
Distribution of Classes and How It Affects Model Selection
Only 3% of the observations in the training data are labeled “Blue Tarp”. This is a very unbalanced training set. This is not unexpected because it would be extremely surprising if a high percentage of land area was covered by blue tarps. In fact, if that were the case, then I would expect it would not be challenging for aid workers to find survivors because survivors would be covering a high percentage of Haiti, or Haiti would just have blue tarps everywhere and they would be of little predictive power for finding survivors.
99% accuracy, while impressive sounding, needs to be considered within context. Accuracy is 97% if the model predicted “No” for every pixel; however no blue tarps would be identified and survivors would remain undiscovered by aid workers. The “best guess” scenario, with high accuracy, provides no value for the humans in our toy problem. 99% accuracy is better than guessing, and the closer to 100% accuracy we can get, the better.
Before selecting any model, I recommend working with experienced aid workers and local Haitian residents with knowledge of the land to determine the choice between, and balance of these considerations. However, in this case, if we want the model that is least likely to send aid workers to non-blue tarps we will select the tuned Logistic Regression Model WIth Sqr Terms; additionally if we want the model that correctly identified the greatest percentage of blue tarps we will also select the tuned Logistic Regression Model with Sqr Terms.
The Logistic Regression Model with Sqr Terms, while lower in Precision, performed better on other measures which are more appropriate for this problem.
If time is of the essence, KNN Took a Long Time to Tune on my Laptop: Of all the models, tuning KNN for both the highest accuracy “number of neighbors” and highest accuracy probability threshold took the most time for my laptop to complete. In fact, on my laptop, I started the threshold tuning loop and went and did chores while periodically checking for completion. I am assuming that in a real-world scenario, each night the model would be validated and tuned with additional labeled data to improve performance. Adding to this concern, I only tested neighbors in the range 1 to 51, and with 60,000+ observations a larger k may have performed better, but the time needed to test additional k tuning parameters was too great.
If time is of the essence, which in this scenario it would be because getting predictions for locations of survivors is only the first step in making a plan for volunteers to reach the survivors, it may make sense to select a model that does not take as long to produce a result. The KNN model performed almost as well as the Logistic Regression with the complex model when considering Negative Predictive Value, and False Negative Rate; however the Logistic Regression Model performed better on these two metrics. However, the KNN model performed better at AUROC, and Precision than the Logistic Regression model, but with a significant increase in time to compute.
Very small penalty for additional parameters:
The penalty (\(\lambda\)) that performed the best using elasticnet penalized regression was very small. The conclusion I draw is that while lasso performed the best as accuracy when comparing ridge, elasticnet, and lasso, the penalties for the additional predictors was very small and with such a small ratio of number of parameters to number of observations, it wasn’t significantly useful, in this case, to shrink the model.
Request For Additional Information for Real-World Implementation
I am interested in how to use limited resources to reach the greatest number of vulnerable people. Without the GIS information for each pixel I am unable to calculate locations where the most blue tarps will be found to able to help the most people; however, there may not be a 1:1 relationship between number of tarps and number of people, so resource allocation becomes even trickier. My conclusion is that without the added GIS information for the observations I cannot provide information to help an efficient distribution of aid to those affected by the earthquake. And, demographic information regarding population density, and members-per-household would be of priceless value. Additionally, GIS information for pre-earthquake roads, and other geographic features would be useful.
The following code is not evaluated in the knitted document, but is included here for completeness.
# directory
holdout_dir = "HoldOut"
# unzip
unzip(file.path(holdout_dir,"Hold+Out+Data.zip"), exdir = holdout_dir)read_lines(file.path(holdout_dir, "orthovnir057_ROI_NON_Blue_Tarps.txt"), n_max=12)df_NotBlue_057 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir057_ROI_NON_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_057)
dim(df_NotBlue_057)read_lines(file.path(holdout_dir, "orthovnir067_ROI_NOT_Blue_Tarps.txt"), n_max=12)df_NotBlue_067 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir067_ROI_NOT_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_067)
dim(df_NotBlue_067)read_lines(file.path(holdout_dir, "orthovnir069_ROI_NOT_Blue_Tarps.txt"), n_max=12)df_NotBlue_069 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir069_ROI_NOT_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_069)
dim(df_NotBlue_069)read_lines(file.path(holdout_dir, "orthovnir078_ROI_NON_Blue_Tarps.txt"), n_max=12)df_NotBlue_078 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir078_ROI_NON_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '0') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_NotBlue_078)
dim(df_NotBlue_078)read_lines(file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"), n_max=12)df_Blue_067 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '1') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)#> Error in file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"): object 'holdout_dir' not found
head(df_Blue_067)#> Error in head(df_Blue_067): object 'df_Blue_067' not found
dim(df_Blue_067)#> Error in eval(expr, envir, enclos): object 'df_Blue_067' not found
File 2
read_lines(file.path(holdout_dir, "orthovnir069_ROI_Blue_Tarps.txt"), n_max=12)df_Blue_069 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir069_ROI_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '1') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_Blue_069)
dim(df_Blue_069)read_lines(file.path(holdout_dir, "orthovnir078_ROI_Blue_Tarps.txt"), n_max=12)df_Blue_078 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir078_ROI_Blue_Tarps.txt"),
skip=7, col_types = cols(';'="-", ID = "c"))) %>%
rename(Red = B1, Green = B2, Blue = B3) %>%
mutate(ClassBinary = '1') %>%
select(Lat, Lon, Red, Green, Blue, ClassBinary)
head(df_Blue_078)
dim(df_Blue_078)# join the 7 dataframes of hold out data
df_HoldOut = dplyr::bind_rows(df_NotBlue_057, df_NotBlue_067, df_NotBlue_069, df_NotBlue_078, df_Blue_067, df_Blue_069, df_Blue_078)
dim(df_HoldOut)
head(df_HoldOut)library(readr)
write_csv(df_HoldOut, "holdout.csv")Confirm we have the entire training data set for training each model with the tuned parameters
haitiAllTraining = haitiBinaryRF %>%
mutate(ClassBinary = ifelse(ClassBinary == "No","0","1")) %>%
mutate(ClassBinaryYesNo = ifelse(ClassBinary == 0, "No","Yes")) %>% mutate(ClassBinaryYesNo = factor(ClassBinaryYesNo))
head(haitiAllTraining)#> # A tibble: 6 x 13
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB GB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <chr> <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 2 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 3 Vege~ 64 66 49 13.2 12.8 0 16900 32041 4224 3136 3234
#> 4 Vege~ 75 82 53 18.2 16.4 0 24649 44100 6150 3975 4346
#> 5 Vege~ 74 82 54 18.5 16.4 0 24336 44100 6068 3996 4428
#> 6 Vege~ 72 76 52 16.4 15.4 0 21904 40000 5472 3744 3952
#> # ... with 1 more variable: ClassBinaryYesNo <fct>
filename_ho = 'holdout.csv'
haiti_ho <- read_csv(filename_ho)#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> Lat = col_double(),
#> Lon = col_double(),
#> Red = col_double(),
#> Green = col_double(),
#> Blue = col_double(),
#> ClassBinary = col_double()
#> )
dim(haiti_ho)#> [1] 2004177 6
Next, I apply the transformations used on the testing data:
haiti_ho_full = haiti_ho %>%
mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = factor(ClassBinary)) %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2)), RG = (Red * Green), RB = (Red * Blue), GB = (Green * Blue)) %>%
mutate(ClassBinaryYesNo = ifelse(ClassBinary == 0, "No","Yes")) %>% mutate(ClassBinaryYesNo = factor(ClassBinaryYesNo)) %>%
mutate(ClassBinary = factor(ClassBinary))
head(haiti_ho_full) #> # A tibble: 6 x 14
#> Lat Lon Red Green Blue ClassBinary GBSqr RBSqr RGSqr RGBSqr RG RB
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <I<d> <I<d> <I<d> <I<db> <dbl> <dbl>
#> 1 18.5 -72.4 104 89 63 0 23.1 27.9 37249 65536 9256 6552
#> 2 18.5 -72.4 101 80 60 0 19.6 25.9 32761 58081 8080 6060
#> 3 18.5 -72.4 103 87 69 0 24.3 29.6 36100 67081 8961 7107
#> 4 18.5 -72.4 107 93 72 0 27.2 32.0 40000 73984 9951 7704
#> 5 18.5 -72.4 109 99 68 0 27.9 31.3 43264 76176 10791 7412
#> 6 18.5 -72.4 103 73 53 0 15.9 24.3 30976 52441 7519 5459
#> # ... with 2 more variables: GB <dbl>, ClassBinaryYesNo <fct>
haiti_ho_full %>%
group_by(ClassBinary) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 2 x 3
#> ClassBinary N Perc
#> * <fct> <int> <dbl>
#> 1 0 1989697 99
#> 2 1 14480 1
Our test data is only 1% Blue Tarps, which is a differente distribution from our training data.
# train the selected logistic regression model with the complete training data set
glm.all.train = glm(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = haitiAllTraining)#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.preds = glm.all.train %>% augment(newdata=haiti_ho_full) %>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .1, "0", "1")) %>%
mutate(.prediction = factor(.prediction))
caret::confusionMatrix(glm.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1966055 89
#> 1 23642 14391
#>
#> Accuracy : 0.9882
#> 95% CI : (0.988, 0.9883)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.5433
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9881
#> Specificity : 0.9939
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.3784
#> Prevalence : 0.9928
#> Detection Rate : 0.9810
#> Detection Prevalence : 0.9810
#> Balanced Accuracy : 0.9910
#>
#> 'Positive' Class : 0
#>
For Logistic Regression, the hold-out test results with the tuned threshold of 0.1 are:
function_ROC_AUC(glm.preds$.fitted, glm.preds$ClassBinary, "ROC Hold-out Logistic Regression")#> Error in prediction(fitted, truth): could not find function "prediction"
The AUC for hold-out the test data using a logistic regression model is 0.9997.
# non fits the model on the entire training set
trctrl.lda.selected = trainControl(method = "none", summaryFunction=twoClassSummary, classProbs=T)
lda.cv.model.selected = train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "lda", trControl=trctrl.lda.selected, tuneLength = 10)lda.test = predict(lda.cv.model.selected, newdata = haiti_ho_full, type="prob")lda.preds = lda.test %>%
mutate(.prediction=ifelse(Yes < .8, "0", "1")) %>%
mutate(.prediction = factor(.prediction))
caret::confusionMatrix(lda.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1986970 2081
#> 1 2727 12399
#>
#> Accuracy : 0.9976
#> 95% CI : (0.9975, 0.9977)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.8364
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9986
#> Specificity : 0.8563
#> Pos Pred Value : 0.9990
#> Neg Pred Value : 0.8197
#> Prevalence : 0.9928
#> Detection Rate : 0.9914
#> Detection Prevalence : 0.9925
#> Balanced Accuracy : 0.9275
#>
#> 'Positive' Class : 0
#>
For LDA, the hold-out test results with the tuned threshold of 0.8 are:
function_ROC_AUC(lda.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out LDA")#> [1] 0.9966824
The AUC for hold-out the test data using an LDA is 0.997.
# fit the model on the entire training set
trctrl.qda.selected = trainControl(method = "none", summaryFunction=twoClassSummary, classProbs=T)
qda.cv.model.selected = train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "qda", trControl=trctrl.qda.selected, tuneLength = 10)# test with holdout data
qda.test = predict(qda.cv.model.selected, newdata = haiti_ho_full, type="prob")
#head(qda.test)qda.preds = qda.test %>%
mutate(.prediction=ifelse(Yes < .8, "0", "1")) %>%
mutate(.prediction = factor(.prediction))
caret::confusionMatrix(qda.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1979609 8727
#> 1 10088 5753
#>
#> Accuracy : 0.9906
#> 95% CI : (0.9905, 0.9907)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3748
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9949
#> Specificity : 0.3973
#> Pos Pred Value : 0.9956
#> Neg Pred Value : 0.3632
#> Prevalence : 0.9928
#> Detection Rate : 0.9877
#> Detection Prevalence : 0.9921
#> Balanced Accuracy : 0.6961
#>
#> 'Positive' Class : 0
#>
For QDA, the hold-out test results with the tuned threshold of 0.8 are:
function_ROC_AUC(qda.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out QDA")#> [1] 0.6348543
The AUC for hold-out the test data using an QDA and a threshold of 0.8 is 0.635.
# fit the model on the entire training set
train_knn <- train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "knn", tuneGrid = data.frame(k=27))
# test with the hold out data
knn.test = predict(train_knn, newdata = haiti_ho_full, type="prob")knn.preds = knn.test %>%
mutate(.prediction=ifelse(Yes < .5, "0", "1")) %>%
mutate(.prediction = factor(.prediction))
caret::confusionMatrix(knn.preds$.prediction, haiti_ho_full$ClassBinary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1975687 1858
#> 1 14010 12622
#>
#> Accuracy : 0.9921
#> 95% CI : (0.992, 0.9922)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.6104
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9930
#> Specificity : 0.8717
#> Pos Pred Value : 0.9991
#> Neg Pred Value : 0.4739
#> Prevalence : 0.9928
#> Detection Rate : 0.9858
#> Detection Prevalence : 0.9867
#> Balanced Accuracy : 0.9323
#>
#> 'Positive' Class : 0
#>
For QDA, the hold-out test results with the tuned threshold of 0.8 are:
function_ROC_AUC(knn.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out KNN")#> [1] 0.9541119
The AUC for hold-out the test data using an KNN | k = 27 is 0.954.
I already have the matrix for the training data to retrain the model with the complete data set. However, I do need to create the testing matrix.
I tried using the lambda from the tuning, and it did not produce convergence on the test data.
x.haitiHoldout.test = model.matrix(frmla, data = haiti_ho_full)[,-1] # removing the intercept term from the formula
# switch ClassBinary to TRUE FALSE to facilitate the next bit of code
y.haitiHoldout.test = (haiti_ho_full %>% mutate(ClassBinary = ifelse(ClassBinary == '0', FALSE, TRUE)))$ClassBinary
#-- Get lambda sequence so consistent over all folds
lam_seq = glmnet(x.haitiFull.train, y.haitiFull.train, family="binomial", alpha=a)$lambdaTraining on entire training data set, and predict with the hold out data. Evaluation of the testing data is performed with the tuned lambda and the best threshold based on accuracy.
#- fit the lasso with all the training data and teh best lambda
lasso_holdout_model = glmnet(x.haitiFull.train, y.haitiFull.train, alpha = 1,
family="binomial")
# make test predictions
lasso.test = predict(lasso_holdout_model, x.haitiHoldout.test, s = 5.630136e-05, type = "response")
lasso_out_confusion = tibble()
#- evaluate fold k
y_true = y.haitiHoldout.test
# set the threshold to p
thres = 0.2
y_hat = lasso.test > thres
#- output
lasso_out_confusion = bind_rows(lasso_out_confusion,
tibble(truth = y.haitiHoldout.test,
fitted = y_hat))
lasso_out_confusion = lasso_out_confusion %>% mutate(truth = factor(truth), fitted = factor(fitted))
caret::confusionMatrix(lasso_out_confusion$fitted, lasso_out_confusion$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1974381 100
#> TRUE 15316 14380
#>
#> Accuracy : 0.9923
#> 95% CI : (0.9922, 0.9924)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.6476
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9923
#> Specificity : 0.9931
#> Pos Pred Value : 0.9999
#> Neg Pred Value : 0.4842
#> Prevalence : 0.9928
#> Detection Rate : 0.9851
#> Detection Prevalence : 0.9852
#> Balanced Accuracy : 0.9927
#>
#> 'Positive' Class : FALSE
#>
For QDA, the hold-out test results with the tuned threshold of 0.2 are:
function_ROC_AUC(lasso.test, haiti_ho_full$ClassBinary, "ROC Hold-out Lasso")#> [1] 0.9997257
The AUC for hold-out the test data using an lasso with tuned lambda and threshold of 0.2 is 0.9997.
haitiAllTraining = haitiAllTraining %>% mutate(ClassBinary = factor(ClassBinary))
head(haitiAllTraining)#> # A tibble: 6 x 13
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB GB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct> <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 2 Vege~ 64 67 50 13.7 13.0 0 17161 32761 4288 3200 3350
#> 3 Vege~ 64 66 49 13.2 12.8 0 16900 32041 4224 3136 3234
#> 4 Vege~ 75 82 53 18.2 16.4 0 24649 44100 6150 3975 4346
#> 5 Vege~ 74 82 54 18.5 16.4 0 24336 44100 6068 3996 4428
#> 6 Vege~ 72 76 52 16.4 15.4 0 21904 40000 5472 3744 3952
#> # ... with 1 more variable: ClassBinaryYesNo <fct>
#- fit the random forest with all the training data use the best tuning
# training
rf.haiti_holdout = randomForest(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data = haitiAllTraining, mtry=4, ntree=1000)
# test
rf.preds_holdout = predict(rf.haiti_holdout, newdata = haiti_ho_full, type="prob")
y_true = haiti_ho_full %>% mutate(ClassBinary = (ClassBinary == "1"))
y_true_val = y_true$ClassBinary
head(y_true_val)#> [1] FALSE FALSE FALSE FALSE FALSE FALSE
#y_true = haiti_ho_full$ClassBinary
#unique(y_true)
# set the threshold to p
y_hat = rf.preds_holdout[,2] > 0.6
head(y_hat)#> 1 2 3 4 5 6
#> FALSE FALSE FALSE FALSE FALSE FALSE
out_yhat_ytrue = tibble()
#- output
out_yhat_ytrue = bind_rows(out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat, X0 = rf.preds_holdout[,1], X1 = rf.preds_holdout[,2]))
out_yhat_ytrue = out_yhat_ytrue %>% mutate(true = factor(true), hat = factor(hat))caret::confusionMatrix(out_yhat_ytrue$hat, out_yhat_ytrue$true)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1982836 6219
#> TRUE 6861 8261
#>
#> Accuracy : 0.9935
#> 95% CI : (0.9934, 0.9936)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.5549
#>
#> Mcnemar's Test P-Value : 2.086e-08
#>
#> Sensitivity : 0.9966
#> Specificity : 0.5705
#> Pos Pred Value : 0.9969
#> Neg Pred Value : 0.5463
#> Prevalence : 0.9928
#> Detection Rate : 0.9894
#> Detection Prevalence : 0.9925
#> Balanced Accuracy : 0.7835
#>
#> 'Positive' Class : FALSE
#>
For Random Forest using 1000 trees, and 4 predictors at each split, the hold-out test results with the tuned threshold of 0.6 are:
#head(out_yhat_ytrue)
function_ROC_AUC(out_yhat_ytrue$X1, out_yhat_ytrue$true, "ROC Hold-out Random Forest")#> [1] 0.976642
The AUC for the testing data on the tuned Random Forest model is 0.977.
svm_model_holdout = tune(svm, ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data=haitiAllTraining, kernel="linear", ranges=list(cost=c(1)), probability=TRUE)
pred = predict(svm_model_holdout$best.model, newdata = haiti_ho_full, probability=TRUE)
y_hat = attr(pred, "probabilities")[,"1"] > 0.2
out_svm_holdout = tibble()
# use y_true_val from random forest
out_svm_holdout = bind_rows(out_svm_holdout,
tibble(pred_value = y_hat, truth_value = y_true$ClassBinary, noProb = attr(pred, "probabilities")[,"0"], yesProb = attr(pred, "probabilities")[,"1"]))
out_svm_holdout = out_svm_holdout %>% mutate(pred_value = factor(pred_value), truth_value = factor(truth_value))
#head(out_svm_holdout)caret::confusionMatrix(out_svm_holdout$pred_value, out_svm_holdout$truth_value)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1968406 98
#> TRUE 21291 14382
#>
#> Accuracy : 0.9893
#> 95% CI : (0.9892, 0.9895)
#> No Information Rate : 0.9928
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.5691
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.9893
#> Specificity : 0.9932
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.4032
#> Prevalence : 0.9928
#> Detection Rate : 0.9822
#> Detection Prevalence : 0.9822
#> Balanced Accuracy : 0.9913
#>
#> 'Positive' Class : FALSE
#>
For SVM using a linear kernel, with a cost == 1, the hold-out test results with the tuned threshold of 0.2 are:
function_ROC_AUC(out_svm_holdout$yesProb, out_svm_holdout$truth_value, "ROC Hold-out SVM")#> [1] 0.999491
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
| Log Reg | N/A | 0.9998 | 0.1 | 0.9882 | 0.994 | 0.012 | 0.38 |
| LDA | N/A | 0.997 | 0.3 | 0.9976 | 0.856 | 0.001 | 0.8197 |
| QDA | N/A | 0.635 | 0.8 | 0.9906 | 0.397 | 0.005 | 0.3632 |
| KNN | k = 27 | 0.954 | 0.5 | 0.9921 | 0.872 | 0.007 | 0.4739 |
| Penalized Log Reg | lamda = 5.630136e-05 | 0.9997 | 0.2 | 0.9923 | 0.993 | 0.008 | 0.484 |
| Random Forest | ntree = 1000, mtry = 4 | 0.977 | 0.6 | 0.9937 | 0.588 | 0.003 | 0.5578 |
| SVM | linear kernel, cost == 1 | 0.999 | 0.2 | 0.9895 | 0.993 | 0.01 | 0.41 |